-
Notifications
You must be signed in to change notification settings - Fork 38
Sequential download uses an in-memory cache for coalesced range within a segment #558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
hoytak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though I'm worried a bit about memory use when there are a lot of parallel downloads. Maybe we should change some defaults for the cache size?
| debug!(call_id, num_tasks = terms.len(), "enqueueing download tasks"); | ||
|
|
||
| // in-memory cache for this segment | ||
| let coalesced_range_reuse_cache = Arc::new(MemoryCache::default()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like @hoytak I'm worried this could suck up too much RAM since it can keep growing.
I think capping the overall byte size of the cache would be good - maybe the default is 1GB but configurable by env variable.
Instead of random eviction, do we want to evict based on min heap of num_bytes_in_segment? Meaning, let's evict the smallest segments since that would minimize the number of network requests.
Also, think we should probably log a bit more so we know how many times a segment is reused. That is likely nice-to-have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think evicting based on num_bytes_in_segment would work. Then the cache will fill up with large chunks and then any small chunks in it will get immediately evicted, before they would be used.
A better heuristic here perhaps is insertion time. However, I think the memory issues can possibly be resolved by having a common cache with the larger limit instead of the per-download cache. Or simply changing that default to be smaller, reflecting the limit on the number of parallel downloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the main optimization goal leading to adding a memory-cache is to limit the number of network requests to reconstruct a file? By minimizing this we reduce overall download time and we ensure we don't download more bytes than we need (repeatedly downloading the same segment).
So by having an upper bound for cache size we need to evict entries. The entries we should evict should be the least used & smallest segments. These segments being evicted will result in the least number of network requests.
Since we get all the reconstruction info at the beginning of the download we can do the bookkeeping to see which segments have the most terms and maybe weight those segments based on overall num bytes when inserting into cache. Then as we consume from cache we can decrement remaining terms from segment and remove from cache when segment is fully consumed. When evicting, we evict the segments with the least usage + least bytes -> preserving the biggest segments with the most remaining terms.
Yes, this will result in the cache being filled up with large chunks but those large chunks also have lots of remaining terms to be consumed so worth keeping them. And the large chunks will be removed from cache once all their terms are consumed anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The map would be something like:
(xorb_id : ( num_terms_remaining, weighted_segment_score, bytes ))
Weighted segment score = ( (segment_bytes / HF_XET_NUM_RANGE_IN_SEGMENT_BASE) / num_terms
This means segment_bytes as % of configurable segment size divided by number of terms it satisfies.
The smallest of these weighted scores should be evicted.
This way on a cache hit once num_terms_remaining hits 0 the entire cache entry is removed.
| // the requested range cannot be larger than the fetched range, and cache the fetched range | ||
| // because it will be reused within the segment. | ||
| // "else" case data matches exact, save some work, return whole data. | ||
| if self.term.range.start != chunk_range.start || self.term.range.end != chunk_range.end { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this condition captures some of the cases but not all. this helps with the case we're dealing but it's not general for everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I think it's a pretty solid heuristic here -- the only case it doesn't capture is when the downloaded range is referenced in full in this situation and then partially referenced later. But I think for now, best-effort is probably fine.
| * effectively taking [skip_bytes..skip_bytes+take] | ||
| * out of the downloaded range */ | ||
| #[derivative(Debug = "ignore")] | ||
| pub coalesced_range_reuse_cache: Arc<dyn ChunkCache>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish there was a way to use 1 cache passed in here, there's one in FetchTermDownload
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely a trade-off. Using a disk cache brings down memory usage but as Adrien and Corentin mentioned using a disk cache slows down their environment a LOT. I don't think there is a universal solution for all environments and this approach is more tailored for their usage, meaning, I don't think using an in-memory cache is suitable for all hf-xet users either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no evictions, this could get large, I think it's fair to assume for fragmented files that we will have all of the ranges in here.
| struct MemoryCacheItem { | ||
| range: ChunkRange, | ||
| chunk_byte_indices: Vec<u32>, | ||
| data: Vec<u8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use Bytes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be the benefit of Bytes to Vec w.r.t. implementing this ChunkCache trait?
With the disk cache turned off, when multiple ranges are coalesced into one larger download range (i.e. one pre-signed URL), this range will be downloaded multiple times without being cached. This PR adds an in-memory cache only to cache these coalesced ranges. This in-memory cache is per-segment, meaning dropped after the entire segment is full reconstructed, because the reconstruction info is returned segment wise and only adjacent or overlapping ranges within a segment are coalesced.
Partly adopted the memory cache implementation from ec66175.